We start by introducing the notation used in the batch prediction setting and then go on to detail the techniques we use for asleep/awake classification. Let the number of electrodes used for any given patient be $E$. As described above, we obtain the power spectral density (PSD) information for each day of patient data. Thus, let the total number of frequencies used be $F$. Then, for each patient training data is $\mathcal{D} = \{\x^{(1)}, \ldots, \x^{(n)}\} \subset \mathcal{R}^{(E*F)}$ with corresponding class labels $\mathcal{Y} = \{y^{(1)},\ldots,y^{(n)}\} \in \{-1,1\}$ (we use $-1$ to denote `asleep' and $1$ to denote `awake'), where $n$ is the number of data days.

\textbf{Model.} We are looking to create a model for binary classification that selects a sparse set of electrodes that accurately predict whether a patient was asleep or awake on a particular day. We wish to model the probability that  a patient is awake. A natural choice is to construct a generalized linear model with a logit link function $p(y^{(i)} = 1 | \mathbf{x}^{(i)}; \theta) = \frac{1}{1+e^{-\theta^\top\mathbf{x}^{(i)}}}$, resulting in what is commonly-known as logistic regression. Logistic regression maximizes the log-likelihood of the labels given the data
\begin{align}
\max_\theta \prod_{i=1}^{N} p(y^{(i)} \mid \mathbf{x}^{(i)}; \theta). \nonumber
\end{align}
This can be expressed more conveniently for gradient-based solvers\footnote{For example http://tinyurl.com/minimize-m} as the logistic risk
\begin{align}
\min_\theta \sum_{i=1}^{N} \log \big( 1+e^{y^{(i)}\theta^\top\mathbf{x}^{(i)}} \big).
\end{align}
From an empirical risk minimization standpoint, one idea is to pair the logistic risk with the $\ell_1$ norm regularizer which is commonly used to encourage sparsity~\citep{efron2004least,scholkopf2001learning} in $\theta$. This would encourage not only a sparse use of electrodes, but also of frequencies. However, using more frequencies is not correspondingly more difficult for a single electrode to capture. Indeed, all frequency information from a single electrode comes `free' once a single frequency is used for classification. Additionally, restricting the number of frequencies may obscure the impact of key regions in the brain modulating consciousness. Thus, we propose a model that uses a sparse set of electrodes but, once an electrode is selected by the model, all of its corresponding frequencies may be used for prediction.

\textbf{Mixed norm.} To achieve such ordered sparsity we make use of the following regularizer
\begin{align}
 \lambda \sum_{e=1}^E \left\| \sum_{f=1}^F \theta_{(e,f)} \right\|_0,
\end{align}
where the $\ell_0$ is defined for any scalar $a$ as $\|a\|_0 \in \{0,1\}$ with $\|a\|_0 = 1$ if and only if $a \neq 0$. In this way, no matter the number of frequencies that are used for a particular electrode, if one of the frequencies has non-zero weight it incurs the same cost in the loss. Unfortunately, the $\ell_0$ norm is non-continuous and non-differentiable. Thus, we relax it using the mixed-norm described by \citep{kowalski2009sparse}
\begin{align}
\sum_{e=1}^E \left\| \sum_{f=1}^F \theta_{(e,f)} \right\|_0 \rightarrow \sum_{e=1}^E \sqrt{ \sum_{f=1}^F (\theta_{(e,f)})^2 }. \nonumber
\end{align}
Our entire loss is then
\begin{align}
 {\cal L}(\theta) = \sum_{i=1}^{N} \log \big( 1+e^{y^{(i)}\theta^\top\mathbf{x}^{(i)}} \big) + \lambda \sum_{e=1}^E \sqrt{ \sum_{f=1}^F (\theta_{(e,f)})^2 } \label{eq:batch-loss},
\end{align}
where $\lambda$ controls the strength of the regularization. Eq.~(\ref{eq:batch-loss}) is still non-differentiable. To solve for the optimal $\theta$ we introduce the following lemma.

\textbf{Lemma 1.} \emph{Given any $g(\theta) > 0$}, the following holds:%two statements hold:
\begin{align}
	\sqrt{g(\theta)} = \min_{z>0}\frac{1}{2}\Bigg[\frac{g(\theta)}{z} + z\Bigg]. \label{eq:lemma1}
%	1.\; & \sqrt{g(x)} = \min_{z>0}\frac{1}{2}\Bigg[\frac{g(x)}{z} + z\Bigg]. \label{eq:lemma1} \\
%	2.\; & \frac{1}{2}\Bigg[\frac{g(x)}{z} + z\Bigg] \emph{is jointly convex in x and z, if g(x) is convex.} \nonumber
\end{align}
Note that $z = \sqrt{g(\theta)}$ minimizes the function on the right hand side of the equation, proving the lemma. We may create auxiliary variables $z_e$ and functions $g_e(\theta) = \sum_{f=1}^F (\theta_{(e,f)})^2$ for $1 \leq e \leq E$ and further substitute the mixed-norm regularizer to make it differentiable. The loss can then be solved by alternately fixing $\theta$ and solving for the auxiliary variables $z_e$, then by fixing $z_e$ and solving for $\theta$.

\textbf{Frequency characterization.} In order to determine what frequencies are most predictive of whether a patient is awake or asleep we divide $\mathcal{D}$ into $F$ training sets. For each set $\mathcal{D}_f$ we select only frequency $f$ for each electrode as features. We then train $F$ separate models $\theta^f$ for $f = 1, \ldots, F$ via $\ell_1$-regularized logistic regression,
\begin{align}
\min_{\theta^f} \sum_{i=1}^{N} \log \big( 1+e^{y^{(i)}\theta^\top\mathbf{x}^{(i)}} \big) + \lambda \sum_{e=1}^E | \theta^f | \nonumber
\end{align}
We can then regularize each model to use a small subset of all electrodes and determine the frequency which best predicts whether a patient is awake or asleep.
